Search CORE

23 research outputs found

EEG andmete analüüs ja andmepartitsioonide arendamine masinõppe algoritmidele

Author: Korjus Kristjan
Publication venue
Publication date: 23/10/2017
Field of study

Väitekirja elektrooniline versioon ei sisalda publikatsiooneDoktoritöö käigus valmis uus meetod masinõppe andmete efektiivsemaks kasutamiseks. Klassikalises statistikas on mudelid piisavalt lihtsad, et koos eeldustega andmete kohta, saavad need öelda, kas saadud tulemused on statistiliselt olulised või mitte ehk kas andmetes üldse on signaali, mis oleks mürast erinev. Masinõppe algoritmid, nt sügavad närvivõrgud, sisaldavad sageli sadu miljoneid parameetreid, mis muudab kogu tööprotsessi loogikat. Need mudelid suudavad alati andmed 100% ära kirjeldada – sõltumata signaali olemasolust. Masinõppe keeles on see ületreenimine. Seepärast kasutatakse masinõppes statistilise olulisuse mõõtmiseks teistsugust meetodit. Nimelt pannakse osa algandmeid kõrvale, st neid ei kasutata mudeli treenimisel. Kui kasutatud andmete põhjal on parim mudel valmis tehtud, testitakse seda varem kõrvale jäänud andmete peal. Probleemiks on aga see, et masinõppe algoritmid vajavad väga palju andmeid ning kõik, mis n.ö kõrvale pannakse, läheb mudeli treenimise mõttes raisku. Teadlased on ammu otsinud viise, kuidas seda probleemi leevendada ning kasutusele on võetud mitmeid meetodeid, aga paraku on ka neil kõigil oma puudused. Näiteks ristvalideerimise korral saab kõiki andmeid väga efektiivselt kasutada, ent pole võimalik tõlgendada mudeli parameetreid. Samas kui paneme andmeid kõrvale, on meil see info küll olemas, aga mudel ise on vähemefektiivne. Doktoritöö raames leiutasime uue viisi, kuidas andmete jagamist teha. Antud meetodi puhul jäetakse samuti algul kõrvale andmete testrühm, seejärel fikseeritakse ristvalideerimist kasutades mudeli parameetrid, neid kõrvale pandud andmete peal testides tehakse seda aga mitmes jaos ning igas jaos üle jäänud andmeid kasutatakse uuesti mudeli treenimiseks. Kasutame uuesti küll kõiki andmeid, aga saavutame ka selle, et parameetrid jäävad interpreteeritavaks, nii et me teame lõpuks, kas võitis lineaarne või eksponentsiaalne mudel; kolmekihiline või neljakihiline närvivõrk. Keeruliste andmetega loodusteadustes tihti ongi just seda vaja, et teadusartikli lõpus saaks öelda, milline oli parim mudel. Samas mudeli kaalude kõiki väärtusi polegi tihtipeale vaja. Sellises olukorras on uus meetod meie teada praegu maailma kõige efektiivsem ja parem.A novel more efficient data handling method for machine learning. In classical statistics, models are rather simple and together with some assumptions about the data itself, it is possible to say if the given result is statistically significant or not. Machine learning algorithms on the other hand can have hundreds of millions of model weights. Such models can explain any data with 100% accuracy that changes the rules of the game. This issue is solved by evaluating the models on a separate test set. Some data points are not used in the model fitting phase. If the best model has been found, the quality of the model is evaluated on that test set. This method works well but it has a problem that some of the precious data is wasted for testing the model and not actually used in training. Researches have come up with many solutions to improve the efficiency of data usage. One of the main methods is called nested cross-validation that uses data very efficiently but it has a problem that it makes it very difficult to interpret model parameters. In this thesis, we invented a novel approach for data partitioning that we termed "Cross-validation and cross-testing". First, cross-validation is used on part of the data to determine and lock the model. Then testing of the model on a separate test set is performed in a novel way such that on each testing cycle, part of the data is also used in a model training phase. This gives us an improved system for using machine learning algorithms in the case where we need to interpret model parameters but not the model weights. For example, it gives us a nice possibility to be able to describe that the data has a linear relationship instead of quadratic one or that the best neural network has 5 hidden layers

DSpace at Tartu University Library

Matemaatika õhtuõpik

Author: Aru Juhan
Korjus Kristjan
Saar Elis
Publication venue: [Tallinn] : Hea Lugu
Publication date: 01/01/2014
Field of study

Kopeerimine ja printimine lubatud.http://tartu.ester.ee/record=b2671323~S1*es

DSpace at Tartu University Library

Personality cannot be predicted from the power of resting state EEG

Author: Allik Jüri
Aru Jaan
Korjus Kristjan
Kreegipuu Kairi
Kuldkepp Nele
Uibo Helen
Uusberg Andero
Vicente Raul
Publication venue
Publication date: 30/10/2014
Field of study

In the present study we asked whether it is possible to decode personality traits from resting state EEG data. EEG was recorded from a large sample of subjects (N = 309) who had answered questionnaires measuring personality trait scores of the 5 dimensions as well as the 10 subordinate aspects of the Big Five. Machine learning algorithms were used to build a classifier to predict each personality trait from power spectra of the resting state EEG data. The results indicate that the five dimensions as well as their subordinate aspects could not be predicted from the resting state EEG data. Finally, to demonstrate that this result is not due to systematic algorithmic or implementation mistakes the same methods were used to successfully classify whether the subject had eyes open or eyes closed and whether the subject was male or female. These results indicate that the extraction of personality traits from the power spectra of resting state EEG is extremely noisy, if possible at all.Comment: 14 pages, 4 figure

arXiv.org e-Print Archive

Frontiers - Publisher Connector

Comparison of the approaches.

Author: Kristjan Korjus (3088494)
Martin N. Hebart (3088497)
Raul Vicente (3088500)
Publication venue
Publication date
Field of study

Comparison of the approaches.</p

FigShare

Analysis of neuroscience data (left: EEG dataset; right: spiking dataset) with three different approaches as a function of the relative test set size.

Author: Kristjan Korjus (3088494)
Martin N. Hebart (3088497)
Raul Vicente (3088500)
Publication venue
Publication date
Field of study

Results show the mean accuracy (upper graphs) and proportion of significant results (bottom graphs) out of the 1000 runs. Data set size was fixed to 50 for EEG and to 100 for spikes train data set. Larger test set leads to smaller average accuracy because there is less data for choosing parameters and fitting a model. “Cross-validation and cross-testing” outperforms “cross-validation and testing” in terms of average accuracy and proportion of significant results.</p

FigShare

In the “Nested cross-validation” approach, first (outer) cross-validation is performed to estimate predictability of the data.

Author: Kristjan Korjus (3088494)
Martin N. Hebart (3088497)
Raul Vicente (3088500)
Publication venue
Publication date
Field of study

In each iteration, data are divided into training and test sets. Before training, another (inner) cross-validation loop is used to optimize parameters. As model weights (fitted models) and parameters are different at every partition, it is not possible to report accuracy or statistical significance about a particular set of parameters or model weights.</p

FigShare

Batches of simulated data of different sizes are analyzed 1000 times with three different approaches.

Author: Kristjan Korjus (3088494)
Martin N. Hebart (3088497)
Raul Vicente (3088500)
Publication venue
Publication date
Field of study

Results show the mean accuracy (upper plot) and proportion of significant results (bottom plot) out of the 1000 runs. More data leads to higher average accuracy and increases the proportion of significant results. “Nested cross-validation” outperforms other approaches and the “cross-validation and testing” gives the worst performance in terms of average accuracy and proportion of significant results.</p

FigShare

In the “Cross-validation and testing” approach, the data are divided into two separate sets (cross-validation set and test set) only once.

Author: Kristjan Korjus (3088494)
Martin N. Hebart (3088497)
Raul Vicente (3088500)
Publication venue
Publication date
Field of study

First, different models are trained and validated with cross-validation and the best set of parameters is chosen. Prediction accuracy and statistical significance of the parameters are evaluated on the test set, after training on the cross-validation set.</p

FigShare

Analysis of real data (left: EEG dataset; right: spiking dataset) with three different approaches as a function of data size with test set size fixed at 50%.

Author: Kristjan Korjus (3088494)
Martin N. Hebart (3088497)
Raul Vicente (3088500)
Publication venue
Publication date
Field of study

Results show the mean accuracy (upper graphs) and proportion of significant results (bottom graphs) out of the 1000 runs. More data lead to higher average accuracy and increases the proportion of significant results. “Nested cross-validation” outperforms other approaches while “Cross-validation and testing” gives the worst performance in terms of average accuracy and proportion of significant results. The effect is smaller with EEG data suggesting that more efficient usage of data in the model fitting is not that important and the choice of parameters is actually the main influencer.</p

FigShare